1 Introduction

We are using the same dataset from our first project. The dataset contains 9578 observations and 14 variables and is based on loans from LendingClub, a peer-to-peer lending company, that were made from 2007 to 2015. In summary, our EDA found:

  1. The credit underwriting criteria of LendingClub is proven to be effective as borrowers who do not meet the credit underwriting criteria are more than twice as likely to default in comparison to borrowers who do meet the criteria.
  2. There are some proven relationships between credit.policy and other numeric variables such as int.rate, fico, and inq.last.6mths. Additionally, there was no clearly established relationships between not.fully.paid and other numeric variables (except for credit.policy).
  3. There are statistically significant relationships between the categorical and logical variables purpose, credit.policy, and not.fully.paid.
  4. For all numeric variables except for delinq.2yrs, their mean significantly varies for different categories ofpurpose.

We obtained the dataset from Kaggle here: https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis

Our work is stored on our team GitHub here: https://github.com/jschild01/JMB_DATS_6101

2 Preparing the Dataset for Modeling

2.1 Converting Variables

From our EDA in project 1 we know that the credit.policy and not.fully.paid variables function as logicals, and the purpose variable functions as a factor with 7 levels. We will start by formally converting these variables from numerical date-types in R to logicals and the factor date-type, respectively.

# We determined this makes sense to do from our EDA
str(loans) # keep??
loans$credit.policy <- as.logical(loans$credit.policy)
loans$not.fully.paid <- as.logical(loans$not.fully.paid)

loans$purpose <- as.factor(loans$purpose)

Of the 14 variables in the dataset, 11 are numeric, 2 are logical, and 1 is a factor.

2.2 Adding Additional Logicals

From our EDA we saw that the overwhelming majority of loans had a value of 0 for delinq.2yrs and pub.rec. The variables might be more useful in terms of having stronger correlations and a larger impact in models if we create logical versions of these variables. We will introduce has.delinq.2yrs and has.pub.rec based on whether or not the value is 0 or greater than 0.

# It might help to turn these variables into logicals since most are 0
loans <- loans %>%
  mutate(has.delinq.2yrs = delinq.2yrs > 0,
         has.pub.rec = pub.rec > 0)

2.3 Revolving Balance Transformation

From our EDA we saw that the revol.bal variable had a wide range of values as well as outlier issues. We wondered if it would work better if we took the natural log of it, similar to how the income provided in the dataset the dataset (the log.annual.inc variable) is in the form of the natural log. However, some values for revol.bal are 0, and taking the natural log of that results in -Inf. One way to deal with this without removing those values is to add 1 to all of the values for revol.bal before taking the natural log.

Given the range of these values (the IQr is 1.50625^{4}) adding one should have a negligible effect, and adding one to all values keeps the data consistent. In case it is significant that the revol.bal is 0 we will add a logical variable has.revol.bal based on whether the value of revol.bal is 0 or greater than 0.

# Some loans have a revol.bal value of 0, the natural log of which is -Inf
# Adding 1 to all values before taking the natural log resolves this issue by nudging the value a very small amount
# In case it was meaningful if the loan had a revol.bal value of 0 versus >0, we can add a logical to account for that
loans <- loans %>%
  mutate(log.revol.bal = log(revol.bal+1),
         has.revol.bal = revol.bal > 0)

2.4 Utility of the new variables

Now that we have added some new variables, it would be helpful to look at a correlation plot and make some comparisons to assess whether they would be more useful going forward than their original forms.

# A new correlation matrix with the new and transformed variables
# For our correlation matrix we want to include everything but the purpose variable
# We can put the new variables together at the top
loans_correlation_matrix_new <- loans %>%
  select(-purpose) %>%
  select(revol.bal, log.revol.bal, has.revol.bal, has.delinq.2yrs, has.pub.rec, everything()) %>%
  cor()

# The mixed correlation plot makes a nice visualization
corrplot.mixed(loans_correlation_matrix_new, tl.pos = "lt")

Some notable observations of this: 1. has.delinq.2yrs doesn’t have any strong correlations, and the 2 potentially useful ones (int.rate and fico) are almost exactly the same as delinq.2yrs. 2. Similarly, has.pub.rec doesn’t have any strong correlations, and the 2 potentially useful ones (int.rate and fico) are the same as pub.rec. 3. log.revol.bal has a stronger correlation with dti and a much stronger correlation with revol.util compared to revol.bal. 4. log.revol.bal has a weaker correlation with credit.policy compared to revol.bal. 5. has.revol.bal does not have a significant correlation with anything (log.revol.bal excepted) except revol.util, where the correlation is significantly weaker than that of log.revol.bal.

From this we can conclude that converting delinq.2yrs and pub.rec to logicals did not add any advantage for our dataset. We can also conclude that the has.revol.bal variable is not necessary and will not add value to our analysis beyond what log.revol.bal already covers. Successfully eliminating these as possibilities increases our confidence in the use of their original structures.

Lastly, we see that taking the natural log of revol.bal might be useful, as log.revol.bal has some stronger correlations than revol.bal. We will note this as we proceed with our modeling.

2.5 Updated Visuals

These are updated histograms, boxplots, and Q-Q plots with the log.revol.bal added variable. We will use these to briefly review the EDA we did for the first project.

2.5.1 Histograms

2.5.2 Boxplots

2.5.3 Q-Q Plots

3 Classification Tree

3.1 Preparing and Building the Tree

Since one of our variables of intrest, credit.policy is a logical, we can treat is as a 2-level factor and build a classification tree model using the variables it has significant correlation with, in this case int.rate, fico, and inq.last.6mths.

For this variable we are interested in the loans where the value is FALSE. That is, loans that do not meet the credit underwriting criteria, and therefore are more at risk of defaulting. These loans are in the minority. We will rename the FALSE values to Fail, indicating that the loan fails to meet the underwriting criteria, and rename the TRUE values to Meets, indicating that the loan meets the underwriting criteria.

Building the classification tree, we can see that int.rate was not used in the model, but fico, and inq.last.6mths were.

# Since we are renaming some values let's make a separate dataset for this model
loans_tree <- loans

# First convert the credit.policy variable from a logical to a factor, then rename the levels
loans_tree$credit.policy <- as.factor(loans_tree$credit.policy) %>%
  fct_recode(Meets = "TRUE", Fails = "FALSE")

# Max depth of 4 based on Professor Faruqe's instructions
creditfit <- rpart(credit.policy ~ int.rate + fico + inq.last.6mths, data=loans_tree, method="class", control = list(maxdepth = 4) )
printcp(creditfit) # display the results 
## 
## Classification tree:
## rpart(formula = credit.policy ~ int.rate + fico + inq.last.6mths, 
##     data = loans_tree, method = "class", control = list(maxdepth = 4))
## 
## Variables actually used in tree construction:
## [1] fico           inq.last.6mths
## 
## Root node error: 1868/9578 = 0.19503
## 
## n= 9578 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.484475      0   1.00000 1.00000 0.020759
## 2 0.187366      1   0.51552 0.51552 0.015755
## 3 0.076017      2   0.32816 0.32816 0.012823
## 4 0.010000      3   0.25214 0.25214 0.011329
plotcp(creditfit) # visualize cross-validation results

# Show these results??
summary(creditfit) # detailed summary of splits

To add: comments on results

3.2 Visualizing the Tree

We can better visualize the tree by plotting it out. We can see the inq.last.6mths threshold of greater than or equal to four, as well as the two thresholds of 740 and 660 or FICO scores. It details that of our 9578 loan sample, there are 1868 loans that failed to meet the credit.policy and 7710 that did.

# plot the tree and add text
rpart.plot(creditfit, uniform=TRUE, main="Classification Tree for Credit.Policy", digits = 3, extra = 1)

#plot(creditfit, uniform=TRUE, main="Classification Tree for credit.policy")
#text(creditfit, use.n=TRUE, all=TRUE, cex=.75)

Of all of the loans, 1231 had more greater than or equal to four inq.last.6mths. Of those 1231:

  • 1047 had a FICO lower than 740 and failed to meet the credit.policy
  • 21 had a FICO score 740 or higher and failed to meet the credit.policy
  • 163 had a FICO score 740 or higher and met the credit.policy

This indicates that applicants will likely need a FICO score of 740 or higher if they have greater than or equal to four inq.last.6mths in order to anticipate meeting the credit.policy.

Conversely, of all of the loans, 8347 had less than four inq.last.6mths. Of those 8347:

  • 352 had a FICO lower than 660 and failed to meet the credit.policy
  • 2 had a FICO lower than 660 but still managed to meet the credit.policy
  • 448 had a FICO score 660 or higher and failed to meet the credit.policy
  • 7545 had a FICO score 660 or higher and met the credit.policy

This indicates that applicants will likely need a FICO score of only 660 or higher if they have less than four inq.last.6mths in order to anticipate meeting the credit.policy.

Overall, we can see that less inq.last.6mths aligns with lower FICO score requirements for the credit.policy. Generally, if applicants have equal to or more than four inq.last.6mths, then they will likely need a FICO score of 740 or higher to meet the credit.policy. On the other hand, if applicants have less than four inq.last.6mths, then they will likely only need a FICO score of 660 or higher to meet the credit.policy. If applicants have a FICO score lower than 660, then they are unlikely to meet the credit.policy, regardless.

# create a postscript plot of tree 
# The rpart.plot() function looks quite nice, so we may not need this
post(creditfit, file = "credittree2.ps", title = "Classification Tree for credit.policy")

3.3 Confusion Matrix

To better understand how our model performs we can construct a confusion matrix of the results. The Fails values will function as the positive cases for the confusion matrix we put together based on the results of the classification tree model.

# Creating the confusion matrix
confusion_matrix = confusionMatrix(predict(creditfit, type = "class"), reference = loans_tree$credit.policy)
print('Overall: ')
## [1] "Overall: "
confusion_matrix$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   9.508248e-01   8.269988e-01   9.463031e-01   9.550698e-01   8.049697e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00  2.836114e-102
print('Class: ')
## [1] "Class: "
confusion_matrix$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7489293            0.9997406            0.9985724 
##       Neg Pred Value            Precision               Recall 
##            0.9426440            0.9985724            0.7489293 
##                   F1           Prevalence       Detection Rate 
##            0.8559192            0.1950303            0.1460639 
## Detection Prevalence    Balanced Accuracy 
##            0.1462727            0.8743350
#The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves.According to their scheme a value < 0 is indicating no agreement , 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement. 

Based on the Kappa statistic of 0.83 we would evaluate this classifier as almost perfect agreement.

To add: other comments on the overall stats of the model.

From the Trees.rmd file: The overall accuracy is 95.08%. These are the same metrics of sensitivity (also known as recall rate, TP / (TP+FN) ), specificity (TN / (TN+FP) ), F1 score, and others that we used in Logistic Regression and KNN analyses. Indeed, any “classifiers” can use the confusion matrix approach as one of the evaluation tools.

Let’s put together tables to visualize the confusion matrix as well some statistics on it.

# Creating the confusion matrix table
# To get a table that follows the same format as our notes, with the rows representing true values and columns representing predicted values, we will have to adjust the data a bit.

# Start by converting to a data.frame
cm_table <- as.data.frame(confusion_matrix$table) %>%
  rename(Actual = "Reference")

# Make it so that the rows represent true values and columns represent predicted values
cm_table$Prediction <- paste0("Prediction - ", cm_table$Prediction)
cm_table <- spread(cm_table, Prediction, Freq)

# Output a nice table
# xkabledply(cm_table, title = "Confusion matrix for the tree model")
cm_table %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
Actual Prediction - Fails Prediction - Meets
Fails 1399 469
Meets 2 7708
## Can we bold the values in the first column ("Actual") using kable??
#Creating a table for the confusion matrix statistics
# Start by converting to a data.frame
cm_stats <- as.data.frame(confusion_matrix$byClass) %>%
  rownames_to_column()

# Adjust the column names and values to look better
colnames(cm_stats) <- c("Statistic", "Percentage")
cm_stats$Percentage <- paste0(round(cm_stats$Percentage*100, digits=1), "%")

# Output a nice table
# xkabledply(cm_stats, title = "Confusion matrix statistics for the tree model")
cm_stats %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
Statistic Percentage
Sensitivity 74.9%
Specificity 100%
Pos Pred Value 99.9%
Neg Pred Value 94.3%
Precision 99.9%
Recall 74.9%
F1 85.6%
Prevalence 19.5%
Detection Rate 14.6%
Detection Prevalence 14.6%
Balanced Accuracy 87.4%

For certain statistics this model performs extremely well, with near 100% specificity, precision, and positive predictive value. Negative predictive value is also very high at around 94%.

To add: further comments on the statistics of the model.

3.4 ROC Curve

Can we do this for this type of model?

library(pROC)

# loans_tree$prediction <- predict(creditfit, type = "vector")
loans_tree$prediction <- predict(creditfit, type = "prob")[,2]
tree_roc <- roc(credit.policy ~ prediction, data=loans_tree)
auc(tree_roc) # area-under-curve prefer 0.8 or higher.
plot(tree_roc)

The AUC of the ROC is 0.8773732, which is greater than the 0.8 threshold, so we would consider this to be a good indicator/model.

4 Simple Linear Regression

4.1 Interest Rate vs FICO Score

# fit linear model
linear_model <- lm(int.rate ~ fico, data=loans)
  
# view summary of linear model
summary(linear_model)

# scatterplot with regression line and confidence interval
# scatterplot(int.rate ~ fico, data = loans)

loans %>%
  ggplot(aes(x = fico, y = int.rate)) +
  geom_point(color = "steelblue", alpha = 0.2) +
  geom_smooth(method = "lm", se = T) +
  labs(title = "Interest Rate vs FICO Score",
       x = "FICO Score", y = "Interest Rate") +
  scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
  scale_y_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
  theme_minimal()

4.2 Revolving Utilization Rate vs Interest Rate

# fit linear model
linear_model <- lm(revol.util ~ int.rate, data=loans)
  
# view summary of linear model
summary(linear_model)

# scatterplot with regression line and confidence interval
# scatterplot(revol.util ~ int.rate, data = loans)

loans %>%
  ggplot(aes(x = int.rate, y = revol.util)) +
  geom_point(color = "steelblue", alpha = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "Revolving Line Utilization Rate vs Interest Rate",
       x = "Interest Rate", y = "Revolving Line Utilization Rate") +
  scale_x_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  theme_minimal()

4.3 Installment vs Natural Log of Annual Income

# fit linear model
linear_model <- lm(installment ~ log.annual.inc, data=loans)
  
# view summary of linear model
summary(linear_model)

# scatterplot with regression line and confidence interval
# scatterplot(installment ~ log.annual.inc, data = loans)

loans %>%
  ggplot(aes(x = log.annual.inc, y = installment)) +
  geom_point(color = "steelblue", alpha = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "Installment vs Log of Annual Income",
       x = "Log of Annual Income", y = "Installment") +
  theme_minimal()

4.4 Revolving Utilization Rate vs FICO Score

# fit linear model
linear_model <- lm(revol.util ~ fico, data=loans)
  
# view summary of linear model
summary(linear_model)

# scatterplot with regression line and confidence interval
# scatterplot(revol.util ~ fico, data = loans)

loans %>%
  ggplot(aes(x = fico, y = revol.util)) +
  geom_point(color = "steelblue", alpha = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "Revolving Line Utilization Rate vs FICO Score",
       x = "FICO Score", y = "Revolving Line Utilization Rate") +
  scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  theme_minimal()

5 Multiple Linear Regression

set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
class(loans$not.fully.paid)
model <- train(int.rate~., data=loans, method="lm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

#names(getModelInfo())
model <- lm(int.rate ~ fico+installment+purpose+dti+inq.last.6mths+credit.policy+log.annual.inc, data = loans)
summary(model)
model$coefficients
model$call
coef(model)
##               (Intercept)                      fico               installment 
##              0.4819228196             -0.0005143307              0.0000428540 
##        purposecredit_card purposedebt_consolidation        purposeeducational 
##             -0.0033260863             -0.0012546309              0.0001820383 
##   purposehome_improvement     purposemajor_purchase     purposesmall_business 
##              0.0019883173              0.0014739586              0.0157732687 
##                       dti            inq.last.6mths         credit.policyTRUE 
##              0.0001682692              0.0005470733             -0.0020864159 
##            log.annual.inc 
##             -0.0008158434
confint(model)
##                                   2.5 %        97.5 %
## (Intercept)                4.731186e-01  4.907271e-01
## fico                      -5.236009e-04 -5.050605e-04
## installment                4.107866e-05  4.462934e-05
## purposecredit_card        -4.423902e-03 -2.228270e-03
## purposedebt_consolidation -2.099536e-03 -4.097257e-04
## purposeeducational        -1.609548e-03  1.973625e-03
## purposehome_improvement    5.855629e-04  3.391072e-03
## purposemajor_purchase     -1.358131e-04  3.083730e-03
## purposesmall_business      1.434543e-02  1.720111e-02
## dti                        1.199051e-04  2.166333e-04
## inq.last.6mths             3.765454e-04  7.176012e-04
## credit.policyTRUE         -3.076261e-03 -1.096571e-03
## log.annual.inc            -1.402356e-03 -2.293305e-04
library("broom")

tidyfinal <-  tidy(model)
tidyfinal

Model_Summary <- augment(model)
str(Model_Summary)
head(Model_Summary)
par(mfrow=c(2,2))
plot(model)

vif(model)
lmtest::bptest(model)
car::ncvTest(model)
distBCMod <- caret::BoxCoxTrans(loans$int.rate)
print(distBCMod)
loans <- cbind(loans, dist_new=predict(distBCMod, loans$int.rate)) 
head(loans) 
mod1 <- lm(dist_new ~ fico+installment+purpose+revol.util+log.revol.bal+credit.policy+log.annual.inc, data=loans)
summary(mod1)
lmtest::bptest(mod1)
par(mfrow=c(2,2))
plot(mod1)

6 Logistic Regression Credit Policy

6.1 Effects on Admission by purpose

To study the effects on admission by the factor purpose (credit.policy and purpose are both categorical variables), wee can create two-way contingency table of the outcome and predictors, and make sure there are no cells of zero frequencies.
*A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts.

credit.policypurposetable = xtabs(~ credit.policy + purpose, data = loans)
credit.policypurposetable

6.2 feature selection

set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
# prepare training scheme
loans$credit.policy = as.factor(loans$credit.policy)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
class(loans$credit.policy)
model <- train(credit.policy~., data=loans, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

#names(getModelInfo())

6.3 Logistic Regression Model

<<<<<<< Updated upstream

loans$credit.policy <- factor(loans$credit.policy)
str(loans)
loans$purpose <- factor(loans$purpose)
#str(credit.policy)
credit.policyLogit <- glm(credit.policy ~ inq.last.6mths+fico+log.annual.inc+dti+delinq.2yrs+pub.rec+purpose , data = loans, binomial(link = "logit") )  

We can see the summary of the logit model here:

summary(credit.policyLogit)
xkabledply(credit.policyLogit, title = "Logistic Regression :")
Logistic Regression :
Estimate Std. Error z value Pr(>|z|)
(Intercept) -23.2914 1.0717 -21.7330 0.0000
inq.last.6mths -0.8452 0.0223 -37.8305 0.0000
fico 0.0357 0.0013 26.7228 0.0000
log.annual.inc 0.1406 0.0578 2.4337 0.0149
dti -0.0205 0.0053 -3.8755 0.0001
delinq.2yrs -0.0255 0.0555 -0.4602 0.6454
pub.rec 0.2791 0.1265 2.2056 0.0274
purposecredit_card 0.0268 0.1188 0.2255 0.8216
purposedebt_consolidation 0.4081 0.0894 4.5636 0.0000
purposeeducational -0.0394 0.1848 -0.2131 0.8312
purposehome_improvement 0.2858 0.1609 1.7764 0.0757
purposemajor_purchase 0.3689 0.1927 1.9141 0.0556
purposesmall_business 0.1689 0.1557 1.0847 0.2780

Before moving on, let us look at the model object credit.policyLogit a little deeper. The fitted values can be found from credit.policyLogit$fitted.values. And the first fitted value is 0.9904137. This is the probability of being credit.policyted for data point #1. Compare to the value from predict(credit.policyLogit) to be 4.6377906.

The predict() function gives us the logit value. You can exponentiate to get the odds ratio p/q as 103.315827. And finally, we can find p from p/q, and indeed it is confirmed to be 0.9904137.

The easier way to get that is simply use predict(credit.policyLogit, type = c("response"))[1] = 0.9904137. The predict() function will also allow you to find model prediction with unseen/untrained data points where fitted.values do not give.

p_fitted = credit.policyLogit$fitted.values[1] # this is the model predicated value p-hat for the first data row (not the actual data point p)  

This is stored in the model as the fitted value for the probability p of the first data point. Since the actual data point is a 0/1 True/False value, it is not easy to directly compare them unless we use a cutoff value (default as 0.5) to convert the probability p to 0/1.

Now, for unseen data point, we can use the predict( ) function to find the model predicted values. But be careful of what you are getting with the predict() function in classification models. Let us compare the three options below. For easy comparison, let us use the same data point in the dataset as an example.

# This gives you the predicted values of the data points inside the model.
predict(credit.policyLogit)  # the is from the model, which gives you the value for logit(p) or ln(p/q) 

6.3.1 Confidence Intervals

We can easily determine the confidence intervals of each coefficient with these two slightly different ways:

<<<<<<< Updated upstream

## CIs using profiled log-likelihood
# confint(credit.policyLogit)
xkabledply( confint(credit.policyLogit), title = "CIs using profiled log-likelihood" )
CIs using profiled log-likelihood
2.5 % 97.5 %
(Intercept) -25.4097 -21.2079
inq.last.6mths -0.8895 -0.8020
fico 0.0331 0.0384
log.annual.inc 0.0275 0.2539
dti -0.0309 -0.0101
delinq.2yrs -0.1330 0.0847
pub.rec 0.0355 0.5317
purposecredit_card -0.2052 0.2606
purposedebt_consolidation 0.2326 0.5832
purposeeducational -0.3970 0.3282
purposehome_improvement -0.0260 0.6051
purposemajor_purchase -0.0020 0.7543
purposesmall_business -0.1333 0.4773
## CIs using standard errors
# confint.default(credit.policyLogit)
xkabledply( confint.default(credit.policyLogit), title = "CIs using standard errors" )
CIs using standard errors
2.5 % 97.5 %
(Intercept) -25.3919 -21.1909
inq.last.6mths -0.8890 -0.8014
fico 0.0331 0.0383
log.annual.inc 0.0274 0.2538
dti -0.0309 -0.0101
delinq.2yrs -0.1343 0.0832
pub.rec 0.0311 0.5271
purposecredit_card -0.2060 0.2596
purposedebt_consolidation 0.2328 0.5834
purposeeducational -0.4017 0.3229
purposehome_improvement -0.0295 0.6012
purposemajor_purchase -0.0088 0.7467
purposesmall_business -0.1363 0.4741

6.3.2 Model evaluation

6.3.2.1 Confusion matrix

This is just one of the many libraries you can find the confusion matrix. It is easy to use, but not very powerful, lacking ability to choose cutoff value, and it does not give you all the metrics like accuracy, precision, recall, sensitivity, f1 score etc. Nonetheless, it’s handy.

loadPkg("regclass")
# confusion_matrix(credit.policyLogit)
xkabledply( confusion_matrix(credit.policyLogit), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted FALSE Predicted TRUE Total
Actual FALSE 1099 769 1868
Actual TRUE 221 7489 7710
Total 1320 8258 9578
unloadPkg("regclass")

6.3.2.2 Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC)

Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 is considered good model fit.

loadPkg("pROC") # receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
prob=predict(credit.policyLogit, type = "response" )
loans$prob <- NA
loans$prob=prob
h <- roc(credit.policy~prob, data=loans)
auc(h) # area-under-curve prefer 0.8 or higher.
plot(h)

# unloadPkg("pROC")
head(loans)

We have here the area-under-curve of 0.8873879, which is more than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is considered a good fit.

6.3.2.3 END

6.3.2.4 McFadden

McFadden is another evaluation tool we can use on logit regressions. This is part of what is called pseudo-R-squared values for evaluation tests. We can calculate the value directly from its definition if we so choose to.

credit.policyNullLogit <- glm(credit.policy ~ 1, data = loans, family = "binomial")
mcFadden = 1 - logLik(credit.policyLogit)/logLik(credit.policyNullLogit)
mcFadden

Or we can use other libraries. The pscl (Political Science Computational Lab) library has the function pR2() (pseudo-\(R^2\)) will do the trick.

loadPkg("pscl") # use pR2( ) function to calculate McFadden statistics for model eval
credit.policyLogitpr2 = pR2(credit.policyLogit)
credit.policyLogitpr2
unloadPkg("pscl") 

With the McFadden value of 0.41736, which is analogous to the coefficient of determination \(R^2\), only about 51% of the variations in y is explained by the explanatory variables in the model.

A major weakness of the overall model is likely from the small dataset sample size of 9578. We expect a much higher number of observations will increase the sensitivity of the model.

6.3.2.5 Hosmer and Lemeshow test

The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.

loadPkg("ResourceSelection") # function hoslem.test( ) for logit model evaluation
credit.policyLogitHoslem = hoslem.test(loans$credit.policy, fitted(credit.policyLogit)) # Hosmer and Lemeshow test, a chi-squared test

The result is shown here:

credit.policyLogitHoslem
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  loans$credit.policy, fitted(credit.policyLogit)
## X-squared = 9578, df = 8, p-value < 2.2e-16
# Have not found a good way to display it.

The p-value of 0 is relatively low. This indicates the model is y a good fit, despite all the coefficients are significant.

7 Logistic Regression Not Fully Paid

7.1 Effects on not.fully paid by purpose

To study the effects on admission by the factor purpose (not.fully.paid and purpose are both categorical variables), wee can create two-way contingency table of the outcome and predictors, and make sure there are no cells of zero frequencies.
*A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts.

<<<<<<< Updated upstream

not.fully.paidpurposetable = xtabs(~ not.fully.paid + purpose, data = loans)
not.fully.paidpurposetable

7.2 feature selection

set.seed(7)
# load the library
library(mlbench)
library(caret)
loans$not.fully.paid = as.factor(loans$not.fully.paid)
# load the dataset
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
class(loans$not.fully.paid)
model <- train(not.fully.paid~., data=loans, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

#names(getModelInfo())

7.3 Logistic Regression Model

loans$not.fully.paid <- factor(loans$not.fully.paid)
str(loans)
loans$purpose <- factor(loans$purpose)
#str(not.fully.paid)
not.fully.paidLogit <- glm(not.fully.paid ~., data = loans, binomial(link = "logit") )  

We can see the summary of the logit model here:

summary(not.fully.paidLogit)
xkabledply(not.fully.paidLogit, title = "Logistic Regression :")
Logistic Regression :
Estimate Std. Error z value Pr(>|z|)
(Intercept) 73.6933 20.3074 3.6289 0.0003
credit.policyTRUE -0.2791 0.0965 -2.8909 0.0038
purposecredit_card -0.5149 0.1100 -4.6808 0.0000
purposedebt_consolidation -0.3238 0.0789 -4.1045 0.0000
purposeeducational 0.0555 0.1525 0.3643 0.7157
purposehome_improvement 0.1002 0.1269 0.7896 0.4297
purposemajor_purchase -0.3200 0.1669 -1.9174 0.0552
purposesmall_business 0.5667 0.1174 4.8271 0.0000
int.rate -90.5900 28.3969 -3.1901 0.0014
installment 0.0012 0.0002 6.5208 0.0000
log.annual.inc -0.3876 0.0610 -6.3544 0.0000
dti -0.0005 0.0047 -0.1126 0.9104
fico -0.0067 0.0016 -4.2652 0.0000
days.with.cr.line 0.0000 0.0000 0.8006 0.4234
revol.bal 0.0000 0.0000 3.1662 0.0015
revol.util 0.0031 0.0014 2.1652 0.0304
inq.last.6mths 0.0567 0.0216 2.6242 0.0087
delinq.2yrs -0.1113 0.0976 -1.1404 0.2541
pub.rec -1.4521 0.7184 -2.0212 0.0433
has.delinq.2yrsTRUE 0.0511 0.1600 0.3194 0.7494
has.pub.recTRUE 1.9029 0.7399 2.5717 0.0101
log.revol.bal -0.0153 0.0310 -0.4945 0.6209
has.revol.balTRUE -0.1426 0.2792 -0.5108 0.6095
dist_new 50.0380 15.5003 3.2282 0.0012
prob -0.4150 0.2432 -1.7067 0.0879

Before moving on, let us look at the model object not.fully.paidLogit a little deeper. The fitted values can be found from not.fully.paidLogit$fitted.values. And the first fitted value is 0.1374278. This is the probability of being not.fully.paid for data point #1. Compare to the value from predict(not.fully.paidLogit) to be -1.83682.

The predict() function gives us the logit value. You can exponentiate to get the odds ratio p/q as 0.1593233. And finally, we can find p from p/q, and indeed it is confirmed to be 0.1374278.

The easier way to get that is simply use predict(not.fully.paidLogit, type = c("response"))[1] = 0.1374278. The predict() function will also allow you to find model prediction with unseen/untrained data points where fitted.values do not give.

<<<<<<< Updated upstream

p_fitted = not.fully.paidLogit$fitted.values[1] # this is the model predicated value p-hat for the first data row (not the actual data point p)  
p_fitted

This is stored in the model as the fitted value for the probability p of the first data point. Since the actual data point is a 0/1 True/False value, it is not easy to directly compare them unless we use a cutoff value (default as 0.5) to convert the probability p to 0/1.

Now, for unseen data point, we can use the predict( ) function to find the model predicted values. But be careful of what you are getting with the predict() function in classification models. Let us compare the three options below. For easy comparison, let us use the same data point in the dataset as an example.

# This gives you the predicted values of the data points inside the model.
predict(not.fully.paidLogit)  # the is from the model, which gives you the value for logit(p) or ln(p/q) 

7.3.1 Confidence Intervals

We can easily determin the confidence intervals of each coefficient with these two slightly different ways:

## CIs using profiled log-likelihood
# confint(not.fully.paidLogit)
xkabledply( confint(not.fully.paidLogit), title = "CIs using profiled log-likelihood" )
CIs using profiled log-likelihood
2.5 % 97.5 %
(Intercept) 34.3522 113.9860
credit.policyTRUE -0.4672 -0.0887
purposecredit_card -0.7328 -0.3013
purposedebt_consolidation -0.4780 -0.1688
purposeeducational -0.2491 0.3492
purposehome_improvement -0.1518 0.3459
purposemajor_purchase -0.6572 -0.0017
purposesmall_business 0.3354 0.7957
int.rate -146.9097 -35.5536
installment 0.0008 0.0015
log.annual.inc -0.5074 -0.2683
dti -0.0099 0.0088
fico -0.0098 -0.0036
days.with.cr.line 0.0000 0.0000
revol.bal 0.0000 0.0000
revol.util 0.0003 0.0059
inq.last.6mths 0.0148 0.0999
delinq.2yrs -0.3191 0.0651
pub.rec -3.2674 -0.3318
has.delinq.2yrsTRUE -0.2544 0.3741
has.pub.recTRUE 0.7225 3.7449
log.revol.bal -0.0756 0.0459
has.revol.balTRUE -0.6917 0.4032
dist_new 19.9987 80.7807
prob -0.8891 0.0654
## CIs using standard errors
# confint.default(not.fully.paidLogit)
xkabledply( confint.default(not.fully.paidLogit), title = "CIs using standard errors" )
CIs using standard errors
2.5 % 97.5 %
(Intercept) 33.8915 113.4951
credit.policyTRUE -0.4682 -0.0899
purposecredit_card -0.7305 -0.2993
purposedebt_consolidation -0.4783 -0.1692
purposeeducational -0.2433 0.3543
purposehome_improvement -0.1485 0.3488
purposemajor_purchase -0.6472 0.0071
purposesmall_business 0.3366 0.7968
int.rate -146.2469 -34.9331
installment 0.0008 0.0015
log.annual.inc -0.5072 -0.2681
dti -0.0098 0.0088
fico -0.0098 -0.0036
days.with.cr.line 0.0000 0.0000
revol.bal 0.0000 0.0000
revol.util 0.0003 0.0059
inq.last.6mths 0.0143 0.0990
delinq.2yrs -0.3025 0.0800
pub.rec -2.8602 -0.0440
has.delinq.2yrsTRUE -0.2624 0.3646
has.pub.recTRUE 0.4526 3.3531
log.revol.bal -0.0760 0.0454
has.revol.balTRUE -0.6898 0.4046
dist_new 19.6580 80.4180
prob -0.8916 0.0616

7.3.2 Model evaluation

7.3.2.1 Confusion matrix

This is just one of the many libraries you can find the confusion matrix. It is easy to use, but not very powerful, lacking ability to choose cutoff value, and it does not give you all the metrics like accuracy, precision, recall, sensitivity, f1 score etc. Nonetheless, it’s handy.

loadPkg("regclass")
# confusion_matrix(not.fully.paidLogit)
xkabledply( confusion_matrix(not.fully.paidLogit), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted FALSE Predicted TRUE Total
Actual FALSE 7999 46 8045
Actual TRUE 1488 45 1533
Total 9487 91 9578
unloadPkg("regclass")

7.3.2.2 Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC)

Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 is considered good model fit.

loadPkg("pROC") # receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
prob=predict(not.fully.paidLogit, type = "response" )
loans$prob <- NA
loans$prob=prob
h <- roc(not.fully.paid~prob, data=loans)
auc(h) # area-under-curve prefer 0.8 or higher.
plot(h)

# unloadPkg("pROC")
head(loans)

We have here the area-under-curve of 0.6862751, which is less than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is not considered a good fit.

7.3.2.3 END

7.3.2.4 McFadden

McFadden is another evaluation tool we can use on logit regressions. This is part of what is called pseudo-R-squared values for evaluation tests. We can calculate the value directly from its definition if we so choose to.

not.fully.paidNullLogit <- glm(not.fully.paid ~ 1, data = loans, family = "binomial")
mcFadden = 1 - logLik(not.fully.paidLogit)/logLik(not.fully.paidNullLogit)
mcFadden

Or we can use other libraries. The pscl (Political Science Computational Lab) library has the function pR2() (pseudo-\(R^2\)) will do the trick.

loadPkg("pscl") # use pR2( ) function to calculate McFadden statistics for model eval
not.fully.paidLogitpr2 = pR2(not.fully.paidLogit)
not.fully.paidLogitpr2
unloadPkg("pscl") 

With the McFadden value of 0.071316, which is analogous to the coefficient of determination \(R^2\), only about 8% of the variations in y is explained by the explanatory variables in the model.

A major weakness of the overall model is likely from the small dataset sample size of 9578. We expect a much higher number of observations will increase the sensitivity of the model.

7.3.2.5 Hosmer and Lemeshow test

The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.

loadPkg("ResourceSelection") # function hoslem.test( ) for logit model evaluation
not.fully.paidLogitHoslem = hoslem.test(loans$not.fully.paid, fitted(not.fully.paidLogit)) # Hosmer and Lemeshow test, a chi-squared test

The result is shown here:

not.fully.paidLogitHoslem
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  loans$not.fully.paid, fitted(not.fully.paidLogit)
## X-squared = 9578, df = 8, p-value < 2.2e-16
# Have not found a good way to display it.

The p-value of 0 is relatively high. This indicates the model is not really a good fit, despite all the coefficients are significant.